System and method for processing text utilizing a suite of disambiguation techniques

ABSTRACT

The invention relates to a system and method for processing natural language text utilizing disambiguation components to identify a disambiguated sense for the text. For the method, it comprises applying a selection of the components to the text to identify a local disambiguated sense for the text. Each component provides a local disambiguated sense of the text with a confidence score and a probability score. The disambiguated sense is determined utilizing a selection of local disambiguated senses. The invention also relates to a system and method for generating sense-tagged text. For the method, it comprises steps of: disambiguating a quantity of documents utilizing a disambiguation component; generating a confidence score and a probability score for a sense identified for a word provided by the component; if the confidence score for the sense for the word is below a set threshold, the sense is ignored; and if the confidence score for the sense for the word is above the set threshold, the sense is added to the sense-tagged text.

RELATED APPLICATION

This application claims the benefit of U.S. Provisional Application No.60/496,681 filed on Aug. 21, 2003.

FIELD OF THE INVENTION

The present invention relates to disambiguating natural language text,such as queries to an Internet search engine, web pages and otherelectronic documents, and disambiguating textual output of a speech totext system.

BACKGROUND

Word sense disambiguation is the process of determining the meaning ofwords in text. For example, the word “bank” can mean a financialinstitution, an embankment, or an aerial manoeuvre (or several othermeanings). When humans listen to or read naturally expressed language,they automatically select the correct meaning of each word based on thecontext in which it is expressed. A word sense disambiguator is acomputer-based system for accomplishing this task, and is a criticalcomponent of technology for making naturally expressed languageunderstandable to computers.

A word sense disambiguator is used in applications which require orwhich can be improved by making use of the meaning of the words in thetext. Such applications include but are not limited to: Internet searchand other information retrieval applications; document classification;machine translation; and speech recognition.

It is accepted by those skilled in the art that, although humans performword sense disambiguation effortlessly, and this is a critical step inunderstanding naturally expressed language, no system has yet beendeveloped to accomplish word sense disambiguation of general texts to anaccuracy sufficient to permit deployment in such applications. Evencurrent advanced word sense disambiguation systems may have an accuracyof only approximately 33%, thereby making their results too inaccuratefor many applications.

There is a need for word sense disambiguation system and method whichaddresses deficiencies in the prior art.

SUMMARY OF THE INVENTION

In a first aspect, a method of processing natural language textutilizing disambiguation components to identify a disambiguated sense orsenses for the text is provided. The method comprises applying aselection of the components to the text to identify a localdisambiguated sense for the text. Each component provides a localdisambiguated sense of the text with a confidence score and aprobability score. The disambiguated sense is determined utilizing aselection of local disambiguated senses.

In the method, the components are sequentially activated and controlledby a central module.

The method may further comprise identifying a second selection ofcomponents; and applying the second selection to the text to refine thedisambiguated sense (or senses). Each component in the second selectionprovides a second local disambiguated sense (or senses) of the text witha second confidence score and a second probability score. Thedisambiguated sense (or senses) is determined utilizing a selection ofthe second local disambiguated senses.

In the method, after applying the selection to the text and prior toapplying the second selection to refine the disambiguated sense (orsenses), the further step of eliminating a sense from the disambiguatedsense having a confidence score below a threshold may be executed.

In the method, when a particular component is present in the selectionand the second selection, its confidence and probability scores may beadjusted when applying the second selection to the text.

In the method, the selection and the second selection of components maybe identical.

In the method, the confidence score of the each component may begenerated by a confidence function utilizing a trait of each component.

After applying the selection of components to the text to identify alocal disambiguated sense (or senses) for the text, for each componentof the selection, the method may generate a probability distribution forits disambiguated sense (or senses). Further the method may merge allprobability distributions for the selection.

In the method, the selection of component disambiguates the text usingcontext of the text may be identified from one of the followingcontexts: domain; user history; and specified context.

After applying the selection to the text, the method may refine aknowledge base of each component in the selection utilizing thedisambiguated sense (or senses).

In the method at least one of the selection of components providesresults only for coarse senses.

In the method, results of the selection of components may be combinedinto one result utilizing a merging algorithm.

In the method, the process may utilize a first stage comprising mergingof coarse senses, and a second stage comprising merging of fine senseswithin each coarse sense grouping.

In the method, the merging process may utilize a weighted sum ofprobability distributions, and the weights may be the confidence scoreassociated with the distribution. Further, the merging process maycomprise a weighted average of confidence scores, and the weights areagain the confidence scores associated with the distribution.

In another aspect, a method of processing natural language textutilizing disambiguation components to identify a disambiguated sensefor the text is provided. The method comprises steps of: defining anaccuracy target for disambiguation; and applying a selection ofcomponents from the plurality of disambiguation components to meet theaccuracy target.

In another aspect, a method of processing natural language textutilizing disambiguation components to identify a disambiguated sensefor the text is provided. The method comprises steps of: identifying aset of senses for the text; and identifying and removing an unwantedsense from the set.

In another aspect a method of processing natural language text utilizingdisambiguation components to identify a disambiguated sense for the textis provided. The method comprises steps of: identifying a set of sensesfor the text; and identifying and removing an amount of ambiguity fromthe set of senses.

In another second aspect, a method of generating sense-tagged text isprovided. The method comprises steps of: disambiguating a quantity ofdocuments utilizing a disambiguation component; generating a confidencescore and a probability score for a sense identified for a word providedby the component; if the confidence score for the sense for the word isbelow a set threshold, the sense is ignored; and if the confidence scorefor the sense for the word is above the set threshold, the sense isadded to the sense-tagged text.

In other aspects various combinations of sets and subsets of the aboveaspects are provided.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other aspects of the invention will become moreapparent from the following description of specific embodiments thereofand the accompanying drawings which illustrate, by way of example only,the principles of the invention. In the drawings, where like elementsfeature like reference numerals (and wherein individual elements bearunique alphabetical suffixes):

FIG. 1 is a schematic representation of words and word senses associatedwith an embodiment of a text processing system;

FIG. 2 is a schematic representation of a representative semanticrelationship or words for with the system of FIG. 1;

FIG. 3 is a schematic representation of an embodiment of a textprocessing system providing word sense disambiguation;

FIG. 4 is a block diagram of a word sense disambiguator module, controlfile optimizer, and database elements of the text processing system ofFIG. 3.

FIG. 5 is a diagram of data structures used to represent the semanticrelationships of FIG. 2 for the system of FIG. 3;

FIG. 6 is a flow diagram of a text processing process performed by theembodiment of FIG. 3;

FIG. 7 is flow diagram of a process for a disambiguating step of thetext processing process of FIG. 6;

FIG. 8 is a data flow diagram for the control file optimizer of FIG. 4;and

FIG. 9 is a flow diagram of a bootstrapping process associated with thetext processing system of FIG. 3.

DESCRIPTION OF EMBODIMENTS

The description which follows, and the embodiments described therein,are provided by way of illustration of an example, or examples, ofparticular embodiments of the principles of the present invention. Theseexamples are provided for the purposes of explanation, and notlimitation, of those principles and of the invention. In thedescription, which follows, like parts are marked throughout thespecification and the drawings with the same respective referencenumerals.

The following terms will be used in the following description, and havethe meanings shown below:

Computer readable storage medium: hardware for storing instructions ordata for a computer. For example, magnetic disks, magnetic tape,optically readable medium such as CD ROMs, and semi-conductor memorysuch as PCMCIA cards. In each case, the medium may take the form of aportable item such as a small disk, floppy diskette, cassette, or it maytake the form of a relatively large or immobile item such as hard diskdrive, solid state memory card, or RAM.

Information: documents, web pages, emails, image descriptions,transcripts, stored text etc. that contain searchable content ofinterest to users, for example, contents related to news articles, newsgroup messages, web logs, etc.

Module: a software or hardware component that performs certain stepsand/or processes; may be implemented in software running on ageneral-purpose processor.

Natural language: a formulation of words intended to be understood by aperson rather than a machine or computer.

Network: an interconnected system of devices configured to communicateover a communication channel using particular protocols. This could be alocal area network, a wide area network, the Internet, or the likeoperating over communication lines or through wireless transmissions.

Query: a list of keywords indicative of desired search results; mayutilize Boolean operators (e.g. “AND”, “OR”); may be expressed innatural language.

Text: textual information represented in its usual form within acomputer or associated storage device. Unless otherwise specified, it isassumed to be expressed in natural language.

Search engine: a hardware or software component to provide searchresults regarding information of interest to a user in response to textfrom the user. The search results may be ranked and/or sorted byrelevance.

Sense-tagged text: text in which some or all of the words have beenmarked with a word sense or senses signifying the meaning of the word inthe text.

Sense-tagged corpus: is a collection of sense-tagged text for which thesenses and possibly linguistic information such as part of speech tagsof some or all words have been marked. The accuracy of the specificationof the senses and other linguistic information must be similar to thatwhich would be achieved by a human lexicographer. Thus, if sense-taggedtext is generated by a machine, then the accuracy of word senses thatare marked by the machine must similar that of a human lexicographerperforming word sense disambiguation.

The embodiment relates to natural language processing, and in particularto processing natural language text as a step in an application whichrequires or can be improved by making use of the meaning of the words inthe text. This process is known generally as word sense disambiguation.Applications include but are not limited to:

1. Internet search and other information retrieval applications; both indisambiguating queries to better specify the user's request, and indisambiguating documents to select more relevant results. When workingwith large sets of data, such as a database of documents or web pages onthe Internet, the volume of available data can make it difficult to findinformation of relevance. Various methods of searching are used in anattempt to find relevant information in such stores of information. Someof the best known systems are Internet search engines, such as Yahoo(trademark) and Google (trademark) which allow users to performkeyword-based searches. These searches typically involve matchingkeywords entered by the user with keywords in an index of web pages. Onereason for some difficulties encountered in performing such searches isthe ambiguity of words used in natural language. Specifically,difficulties are often encountered because one word can have severalmeanings, and each meaning can have multiple synonyms or paraphrases.For example, “Java bean” is matched by a search engine to documentswhich simply contain these two words. By disambiguating “Java bean” tomean “coffee bean” instead of the “Java Bean” computer technology by SunMicrosystems, a disambiguator would allow documents about this computertechnology to be excluded from the results, and would similarly allowdocuments concerning coffee beans to be included in the results.

2. Document classification; in allowing documents to be clustered basedupon precise criteria of meaning as opposed to their textual content.For example, consider an application which automatically sorted emailmessages into folders each pertaining to a topic specified by a user.One such folder might be entitled “programming tools”, and contain anyemails that mentioned any form of “programming tool”. The use of wordsense disambiguation in this application would allow emails thatcontained related information, but did not contain words matching thetitle of the folder to be accurately classified as belonging in thefolder or not. For example, the words “Java object” could be placed inthe folder because it contains a sense of “Java” meaning a programminglanguage, whereas an email containing the terms “Java coffee” or “toolsto use in designing a conference program” could be rejected because, inthe first case, the word “Java” is disambiguated to mean a type ofcoffee, and, in the second case, the word “program” refers to an event,which is a meaning not associated with computer programming. Such aneffect could be optionally achieved by giving the senses present in adisambiguated email to a machine learning algorithm, rather than justproviding the words as is currently done by state-of-the-artapplications. The accuracy of the classification would increase as aresult, and the application would appear more intelligent and be moreuseful to the user.

3. Machine translation; in knowing the precise meanings of words beforethey are translated, so that the correct translation can be provided forwords with multiple possible translations. For example, the word “bank”in English may translate into the French “banque” if it means “financialinstitution”, but “rive” if it means “river bank”. In order to performan accurate translation of such a word, it is necessary to select ameaning. It will be recognised by those skilled in the art that a largepercentage of the errors in prior art machine translation systems aremade due to the selection of the wrong senses of words being translated.The addition of word sense disambiguation to such a system would improveaccuracy by reducing or eliminating the errors of this type that aremade by today's state-of-the-art systems.

4. Speech recognition; in allowing utterances with words or combinationsof words that sound the same but are written differently to be correctlyinterpreted. Most speech recognition systems include a recognitioncomponent that analyses the phonetics of a phrase and outputs severalpossible sequences of words that could have been pronounced. Forexample, “I asked to people” and “I asked two people” are pronounced thesame, and would both be output as possible sequences of words by such arecognition component. Most speech recognition systems then include amodule which selects which of the possible word sequences is the mostprobable, and outputs this sequence as the result. This module typicallyoperates by selecting the word sequence that matches most closely withword sequences that are known to be uttered. Word sense disambiguationcould improve the operation of such a module by selecting the wordsequence that leads to the most consistent interpretation. For example,consider a speech recognition system which generated two alternativeinterpretations for an utterance: “I scream in flat endings” or “Icecream is fattening”. A word sense disambiguator would select betweenthese two interpretations which sound the same, in exactly the samemanner as it would disambiguate between two possible interpretations intext which are spelled the same,

5. Text to speech (speech synthesis), in allowing words with multiplepronunciations to be pronounced correctly. For example, “I saw her sowthe seeds” and “The old sow was slaughtered for bacon” both contain theword “sow”, which is pronounced differently in each sentence. A text tospeech application needs to know which interpretation applies to eachword in order to correctly utter each sentence. A word sensedisambiguation module could determine that the sense of “sow” in thefirst sentence was the verb “to sow” and in the second sentence was “afemale hog”. The application would then have the information necessaryto pronounce each sentence correctly.

Before describing specific aspects of the embodiment, some background onrelationships between words and their word senses is provided. Referringto FIG. 1, relationship between words and word senses is shown generallyby the reference 100. As seen in this example, certain words havemultiple senses. Among many other possibilities, the word “bank” mayrepresent: (i) a noun referring to a financial institution; (ii) a nounreferring to a river bank; or (iii) a verb referring to an action tosave money. Similarly, the word “interest” has multiple meaningsincluding: (i) a noun representing an amount of money payable relatingto an outstanding investment or loan; (ii) a noun representing specialattention given to something; or (iii) a noun representing a legal rightin something.

The embodiment assigns senses to words. In particular, the embodimentdefines two senses of words: coarse and fine. A fine sense defines aprecise meaning and usage of a word. Each fine sense applies within aparticular part of speech category (noun, verb, adjective or adverb). Acoarse sense defines a broad concept associated with a word, and may beassociated with more than one part of speech category. Each coarse sensecontains one or more fine senses, and each fine sense belongs to onecoarse sense. A word can have more than one fine and more than onecoarse sense. A fine sense is classified under the coarse sense becausethe fine sense of the word matches the generic concept associated withthe coarse sense definition. Table 1 illustrates the relationshipbetween a word, its coarse senses and its fine senses. As an example toillustrate the distinction between fine and coarse senses, the finesenses for the word “bank” respect the distinction between the verb “tobank” as in “to bank a plane” and the noun “a bank” as in “the pilotperformed a bank”, whereas these two senses are grouped together underthe more general coarse sense “Manoeuvre”. TABLE 1 Word Coarse SenseFine Senses Bank Financial Institutions Financial institution (Noun)Building where banking is done (Noun) Perform Business with a Bank(Verb) Ground formations Land beside water (Noun) Ridge of earth (Noun)Slope in road (Noun) Manoeuvre Flight manoeuvre (Noun) Tip laterally(Verb) Gambling Funds held by a gambling house (Noun) act as a banker ingambling (Verb)

Referring to FIG. 2, example semantic relationships between word sensesare shown. These semantic relationships are precisely defined types ofassociations between two words based on meaning. The relationships arebetween word senses, which are specific meanings of words. For example,a bank (in the sense of a river bank) is a type of terrain and a bluff(in the sense of a noun meaning a land formation) is also a type ofterrain. A bank (in the sense of river bank) is a type of incline (inthe sense of grade of the land). A bank in the sense of a financialinstitution is synonymous with a “banking company” or a “bankingconcern.” A bank is also a type of financial institution, which is inturn a type of business. A bank (in the sense of financial institution)is related to interest (in the sense of money paid on investments) andis also related to a loan (in the sense of borrowed money) by thegenerally understood fact that banks pay interest on deposits and chargeinterest on loans.

It will be understood that there are many other types of semanticrelationships that may be used. Although known in the art, following aresome examples of semantic relationships between words: Words which arein synonymy are words which are synonyms to each other. A hypernym is arelationship where one word represents a whole class of specificinstances. For example “transportation” is a hypernym for a class ofwords including “train”, “chariot”, “dogsled” and “car”, as these wordsprovide specific instances of the class. Meanwhile, a hyponym is arelationship where one word is a member of a class of instances. Fromthe previous list, “train” is a hyponym of the class “transportation”. Ameronym is a relationship where one word is a constituent part of, thesubstance of, or a member of something. For example, for therelationship between “leg” and “knee”, “knee” is a meronym to “leg”, asa knee is a constituent part of a leg. Meanwhile, a holonym arelationship where one word is the whole of which a meronym names apart. From the previous example, “leg” is a holonym to “knee”. Anysemantic relationships that fall into these categories may be used. Inaddition, any known semantic relationships that indicate specificsemantic and syntactic relationships between word senses may be used.

It will be recognized that use of word sense disambiguation in a searchengine addresses the problem of retrieval relevance. Furthermore, usersoften express text as they would express language. However, since thesame meaning can be described in many different ways, users encounterdifficulties when they do not express text in the same specific mannerin which the relevant information was initially classified.

For example if the user is seeking information about “Java” the island,and is interested in “holidays” on Java (island), the user would notretrieve useful documents that had been categorized using the keywords“Java” and “vacation”. The embodiment addresses this issue. It has beenrecognized that deriving precise synonyms and sub-concepts for each keyterm in a naturally expressed text increases the volume of retrievedrelevant retrievals. If this were performed using a thesaurus withoutword sense disambiguation, the result could be worsened. For example,semantically expanding the word “Java” without first establishing itsprecise meaning would yield a massive and unwieldy result set withresults potentially selected based on word senses as diverse as“Indonesia” and “computer programming”. The embodiment provides systemsand methods of interpreting meaning of each word which are semanticallyexpanded to produce a comprehensive and simultaneously more preciseresult set.

Referring to FIG. 3, text processing system associated with anembodiment is shown generally at reference 10. The system takes as inputa text file 12. The text file 12 contains natural language text, such asa query, a document, the output of a speech to text system, or anysource of natural language text in electronic form.

The system includes text processing engine 20. The text processingengine 20 may be implemented as dedicated hardware, or as softwareoperating on a general purpose processor. The text processing engine mayalso operate on a network.

The text processing engine 20 generally includes a processor 22. Theengine may also be connected, either directly thereto, or indirectlyover a network or other such communication means, to a display 24, aninterface 26, and a computer readable storage medium 28. The processor22 is coupled to the display 24 and to the interface 26, which maycomprise user input devices such as a keyboard, mouse, or other suitabledevices. If the display 24 is touch sensitive, then the display 24itself can be employed as the interface 26. The computer readablestorage medium 28 is coupled to the processor 22 for providinginstructions to the processor 22 to instruct and/or configure processor22 to perform steps or algorithms related to the operation of textprocessing engine 20, as further explained below. Portions or all of thecomputer readable storage medium 28 may be physically located outside ofthe text processing engine 20 to accommodate, for example, very largeamounts of storage. Persons skilled in the art will appreciate thatvarious forms of text processing engines can be used with the presentinvention.

Optionally, and for greater computational speed, the text processingengine 20 may include multiple processors operating in parallel or anyother multi-processing arrangement. Such use of multiple processors mayenable the text processing engine 20 to divide tasks among variousprocessors. Furthermore, the multiple processors need not be physicallylocated in the same place, but rather may be geographically separatedand interconnected over a network as will be understood by those skilledin the art.

Text processing engine 20 includes a database 30 for storing a knowledgebase and component linguistic resources used by the text processingengine 20. The database 30 stores the information in a structured formatto allow computationally efficient storage and retrieval as will beunderstood by those skilled in the art. The database 30 may be updatedby adding additional keyword senses or by referencing existing keywordsenses to additional documents. The database 30 may be divided andstored in multiple locations for greater efficiency.

A central component of text processing engine 20 is word sensedisambiguation (WSD) module 32, which processes words from an inputdocument or text into word senses. A word sense is a giveninterpretation ascribed to a word, in view of the context of its usageand its neighbouring words. For example, the word “book” in the sentence“Book me a flight to New York” is ambiguous, because “book” can be anoun or a verb, each with multiple potential meanings. The result ofprocessing of the words by the WSD module 32 is a disambiguated documentor disambiguated text comprising word senses rather than ambiguous oruninterpreted words. WSD module 32 distinguishes between word senses foreach word in the document or text. WSD module 32 identifies whichspecific meaning of the word is the intended meaning using a wide rangeof interlinked linguistic techniques to analyze the syntax (e.g. part ofspeech, grammatical relations) and semantics (e.g. logical relations) incontext. It may use a knowledge base of word senses which expressesexplicit semantic relationships between word senses to assist inperforming the disambiguation.

Referring to FIG. 4, further detail on database 30 is provided.

To assist in disambiguating words into word senses, the embodimentutilizes knowledge base 400 of word senses capturing relationships ofwords as described above for FIG. 2. Knowledge base 400 is associatedwith database 30 and is accessed to assist WSD module 32 in performingword sense disambiguation as well as provide the inventory of possiblesenses of words in a text. While prior art dictionaries, and lexicaldatabases such as WordNet (trademark), have been used in systems,knowledge base 400 provides an enhanced inventory of words, word senses,and semantic relations. For example, while prior art dictionariescontain only definitions of words for each of their word senses,knowledge base 400 also contains information on relations between wordsenses. These relations includes the definition of the sense and theassociated part of speech (noun, verb, etc.), fine sense synonyms,antonyms, hyponyms, meronyms, pertainyms, similar adjectives relationsand other relationships known in the art. Knowledge base 400 alsocontains additional semantic relations not contained in other prior artlexical databases: (i) additional relations between word senses, such asthe grouping of fine senses into coarse senses, “instance of” relations,classification relations, and inflectional and derivationalmorphological relations; (ii) corrections of errors in data obtainedfrom published sources; and (iii) additional words, word senses, andrelations that are not present in other prior art knowledge bases.

In addition to containing an inventory of words and word senses (fineand coarse) for each word and concepts, as well as over 40 specifictypes of semantic links between them, database 30 also provides arepository for component resources 402 used by linguistic components 502and WSD components 504. Some component resources are shared by severalcomponents while other resources are specific to a given component. Inthe embodiment, the component resources include: general models, domainspecific models, user models and session models. General models containgeneral domain information, such as a probability distribution of sensesfor each word for any text of unknown domain. They are trained usingdata from several domains. WSD components 504 and linguistic components502 utilize these resources as necessary. For example, a component mayuse these resources on all requests or may use it only when the requestcannot be completed using more specific models. Domain-specific modelsare trained from domain specific information. They are useful formodelling usage of specialized meanings of words in various domains. Forexample, the word “Java” has different meaning for travel agents andcomputer programmers. These resources allow the building of statisticalmodels for each group. User models are trained for a specific user. Themodels may be given and maybe learnt over time. The user models can beconstructed by the application or automatically by the word sensedisambiguation system. Session models provide information regardingmultiple requests regrouped within a session. For example, several wordsense disambiguation requests may be related to the same topic during aninformation retrieval session using a search engine. The session modelscan be constructed by the application or automatically by WSD module 32.

Database 30 also contains sense-tagged corpus 404. Sense-tagged corpus404 may optionally be split up into sub-units used for trainingcomponents, training confidence functions for components and trainingthe control file optimizer, as described further below.

Referring to FIG. 5, further detail on knowledge base 400 is provided.In the embodiment, knowledge base 400 is a generalized graph datastructure and is implemented as a table of nodes 402 and a table of edgerelations 404 associating two nodes together. Each is described in turn.Annotations of arbitrary data types may be attached to each node oredge. In other embodiments, other data structures, such as linked lists,may be used to implement knowledge base 400.

In table 402, each node is an element in a row of table 402. In theembodiment, a record for each node has as many as the following fields:an ID field 406, a type field 408 and an annotation field 410. There aretwo types of entries in table 402: a word and a word sense definition.For example, the word “bank” in ID field 406A is identified as a word bythe “word” entry in type field 408A. Also, exemplary table 402 providesseveral definitions of words. To catalog the definitions and todistinguish definition entries in table 402 from word entries, labelsare used to identify definition entries. For example, entry in ID field406B is labeled “LABEL001”. A corresponding definition in type field408B identifies the label as a “fine sense” word relationship. Acorresponding entry in annotation filed 410B identifies the label as“Noun. A financial institution”. As such, a “bank” can now be linked tothis word sense definition. Furthermore an entry for the word“brokerage” may also be linked to this word sense definition. Alternateembodiments may use a common word with a suffix attached to it, in orderto facilitate recognition of the word sense definition. For example, analternative label could be “bank/n1”, where the “/n1” suffix identifiesthe label as a noun (n) and the first meaning for that noun. It will beappreciated that other label variations may be used. Other identifiersto identify adjectives, adverbs and others may be used. The entry intype field 408 identifies the type associated with the word. There areseveral types available for a word, including: word, fine sense andcoarse sense. Other types may also be provided. In the embodiment, whenan instance of a word has a fine sense, that instance also has an entryin annotation field 410 to provide further particulars on that instanceof the word.

Edge/Relations table 404 contains records indicating relationshipsbetween two entries in nodes table 402. Table 404 has the followingentries: From node ID column 412, to node ID column 414, type column 416and annotation column 418. Columns 412 and 414 are used to link twoentries in table 402 together. Column 416 identifies the type ofrelation that links the two entries. A record has the ID of the originand the destination node, the type of the relation, and may haveannotations based on the type. Types of relations include “root word toword”, “word to fine sense”, “word to coarse sense”, “coarse to finesense”, “derivation”, “hyponym”, “category”, “pertainym”, “similar”,“has part”. Other relations may also be tracked therein. Entries inannotation column 418 provide a (numeric) key to uniquely identify anedge type going from a word node to either a coarse node or fine nodefor a given part-of-speech.

Referring to FIG. 4, further detail on WSD module 32 is provided. WSDmodule 32 comprises control file optimizer 514, iterative componentsequencer (ICS) 500, linguistic components 502, and WSD components 504.

Turning first to WSD components 504 and linguistic components 502,common characteristics and features of WSD components 504 and linguisticcomponents 502 (“components”) are now described. Results generated by aparticular component are preferably rated using a probabilitydistribution and a confidence score. The probability distribution allowsa component to return a probability figure indicating the likelihoodthat any possible answer is correct. In the case of WSD components 504,possible answers comprise possible senses of words in the text. In thecase of linguistic components 502, possible answers depend on the taskbeing performed by the linguistic component; for example, possibleanswers for part-of-speech tagger 502F are the set of possible part ofspeech tags for each word. The confidence score provides an indicationof a level of confidence of the algorithm in the probabilitydistribution. As such, an answer having a high probability and a highconfidence score indicates that the algorithm has identified a singleanswer as most probable and it is highly likely that the identifiedanswer is accurate. If an answer has a high probability score and a lowconfidence, then although the algorithm has identified a single answeras most probable, its confidence score indicates that it may not becorrect. In the case of WSD components 504, a low confidence score mayindicate that the component is lacking information that it needed todisambiguate this particular word. It is important that each componenthave a good confidence function. A component with a low overall accuracybut a good confidence function is able to contribute to the systemaccuracy despite its low overall accuracy, as the confidence functionwill identify correctly the subset of words for which the answerssupplied by the component can be trusted.

The confidence function considers internal operating features of thecomponent and its algorithm and evaluates potential weaknesses ofaccuracy of the algorithm. For example, if an algorithm relies onstatistical probabilities, it would tend to produce incorrect resultswhen probabilities were calculated from very few examples. Accordingly,for that algorithm, the confidence score will use a variable containingthe number of examples used by the algorithm. A confidence function maycontain several variables, even hundreds of variables. The function isusually created by using the variables as input into a classification orregression algorithm (statistical, such as a generalized linear model,or based upon machine learning, such as a neural network) familiar tothose skilled in the art. The data used to train the classification orregression algorithm is preferably obtained by running the WSD algorithmover a portion of sense-tagged corpus 404 that has been set aside forthis purpose.

Many of the components employ statistical techniques based on machinelearning concepts or other statistical techniques which will be familiarto those skilled in the art. It will be appreciated by those skilled inthe art that such components require use training data, in order toconstruct their statistical models. For example, the priors component504A utilizes many sense-tagged examples of each word in order todetermine what is the statistically most likely sense for thatparticular word. In the embodiment, the training data is provided bysense-tagged corpus 404, which is known by those skilled in the art as a“training corpus”.

Further detail is now provided on features of WSD components 504. EachWSD component 504 attempts to associate the correct senses to words intext using a particular word sense disambiguation algorithm. Each WSDcomponent 504 may run more than one time during the course of adisambiguation. The system provides semantic word data or other forms ofdata in database 30 that each of the algorithms needs in order toperform disambiguation. As noted earlier, each WSD component 504 has analgorithm that executes a particular type of disambiguation andgenerates a probability score and a confidence score with its results.The WSD components include but are not limited to: priors component504A; example memory component 504B; n-gram component 504C; conceptoverlapping component 504E; heuristic word sense component 504F;frequent words component 504G; and dependency component 504H. Eachcomponent has a specialized knowledge base associated with itsparticular operation. Each component produces a confidence function asdetailed above. Details of each component are described below. Eachtechnique is generally known in the art, unless specific aspects areprovided herein. It will also be appreciated that not all of the WSDcomponents described in the embodiment may be necessary to accomplishaccurate word sense disambiguation, but that some combination ofdifferent techniques is required.

For priors component 504A, it utilizes a priors algorithm to predictword senses by utilizing statistical data on frequency of appearances ofvarious word senses. Specifically the algorithm assigns a probability toeach word sense based on the frequency of the word sense in asense-tagged corpus 404. These frequencies are preferably stored in thecomponent resources 402.

For example memory component 504B, it utilizes an example memoryalgorithm to predict words senses for phrases (or word sequences).Preferably it attempts to predict word senses of all the words in asequence. Phrases typically are defined as a series of consecutivewords. A phrase can be two words long up to a full sentence. Thealgorithm accesses a list of phrases (word sequences) which provide adeemed correct sense for each word in that phrase. Preferably, the listcomprises sentence fragments from sense-tagged corpus 404 that occurredmultiple times where the senses for each of the fragment occurrence wasidentical. Preferably, when an analyzed phrase contains a word which hasa sense which differs from a sense previously attributed to that word inthat phrase, senses in the analyzed phrase are rejected and are notretained in the list of word sequences.

When disambiguating text, the example memory algorithm identifieswhether parts of the text or text match the previously identifiedrecurring sequences of words which have been retained in the list ofword sequences. If there is a match, the module assigns the word sensesof the sequence to the matching words in the text.

For n-gram component 504C, it utilizes an n-gram algorithm whichoperates over a fixed range of words and only attempts to predict asense of a single word once at a time, in contrast to the example memoryalgorithm. The n-grams algorithm predicts word senses for a head word bymatching features immediately surrounding the word in a very narrowwindow. Such features include: lemma, part of speech, coarse of fineword sense, and a name entity type. While the algorithm may examine nwords before or following a target word, typically, n is set at twowords. With n being set at 2, the algorithm utilizes a list of wordpairs with a correct sense associated with each word. This list isderived from word pairs from sense-tagged corpus 404 that occurredmultiple times, where the senses for each of the word pair occurrencewas identical. However, when a sense of at least one word differs, suchword pair senses are rejected and are not retained in the list. Whendisambiguating text, the algorithm matches word pairs from the text ortext being processed with word pair present in the list maintained bythe algorithm. A match is identified when a word pair is found and thesense of one of the two words is already present in the text or textbeing processed. When a match is identified, it is assigned the senserelating to the second word in the word pair being processed.

The component resource associated with the n-grams algorithm is trainedover sense-tagged corpus 404, and is part of component resources 402.The n-grams component resource includes a statistical model whichidentifies when an n-gram has been seen sufficiently frequently tobecome a valid sense predictor. Several predictors from the knowledgebase may by triggered by a pattern of words. These predictors mayreinforce a common sense or may actually generate multiple possiblesenses with a given probability distribution.

For concept overlapping component 504E, it has a concept overlappingalgorithm which predicts a sense for words by choosing the senses whichmatch most closely the general topic of the text segment. In theembodiment, the topic of the text segment is defined as the set of allnon-removed senses for all words in text segment, and topical similarityis assessed by comparing the topic of the text segment which is beingdisambiguated with the topics extracted from the sense tagged corpus 404for each word sense, and choosing the sense of each word with thehighest such similarity. One such method of comparison is thedot-product or cosine metric. There are many other techniques for makinguse of topic similarity to disambiguate text, as will be familiar tothose skilled in the art.

For heuristic word sense component 504F, it has a heuristic word sensealgorithm which predicts a sense of words using human-generated ruleswhich may use intrinsic language properties and semantic links in theknowledge base. For example, the senses “language” in terms of“a spokenhuman language” and “Indonesian” are related in the knowledge base bythe relation “Indonesian is a language”. A sentence containing both“language” and “Indonesian” would have the word “language” disambiguatedby this component. Typically, such a relation has been manuallyverified, thereby providing a high confidence in accuracy.

For frequent words component 504G, it has a frequent words algorithmwhich identifies the senses of the most frequently occurring words. InEnglish, the 500 most frequently occurring words account for almost athird of the words encountered in normal text. For each of these words,a large amount of training examples are available in sense-tagged corpus404. Accordingly, it is possible to train using supervised machinelearning methods specific sense predictors for each word. In theembodiment, the machine learning method used to train the component isboosting, and the features used include the words and parts of speech ofthe words in immediate proximity to the target word to be disambiguated.Other features and machine learning techniques may be used to accomplishthe same goal, as will be familiar to those skilled in the art.

For dependency component 504H, it has a dependency algorithm whichutilizes a sense prediction model based on the semantic dependencies ina sentence. By determining that a word is a head word in a dependency,and optionally the sense of the head word, it predicts the sense of itsdependant words. Similarly, having determined that a word is a dependentand optionally the sense of the dependent word, it can predict the senseof the head word. For example in the text fragment “drive the car”, thehead word is “drive” and the dependant is “car”. Knowledge of the senseof “car” will be sufficient to predict the sense of “drive” as “drive avehicle”.

It will be appreciated that other techniques for word sensedisambiguation become available from time to time as the scientificresearch in the field progresses, and that such other techniques couldequally be included as new WSD components within the system. It will byappreciated that a single WSD component may not be sufficient todisambiguate text with high accuracy. To address this issue, theembodiment utilizes multiple techniques to disambiguate text. Thetechniques described above specify an exemplary combination which iscapable of performing high accuracy word sense disambiguation. Othertechniques may also be used.

Turning now to linguistic components 502, each component 502 provides atext processing function which can be applied to text to determine acertain type of linguistic information. This information is thenprovided to the WSD components 504 for disambiguation. The operation ofeach of the linguistic components 502 will be familiar to one skilled inthe art. The linguistic components 502 include:

Tokenizer 502A which splits input text into individual words andsymbols. Tokenizer 502A processes the input text as a sequences ofcharacters and breaks the input text into a series of tokens, where atoken is the smallest sequence of characters that can form a word.

Sentence boundary detector 502B which identifies sentence boundaries inthe input text. It uses rules and data (e.g., list of abbreviations) toidentify the possible sentence breaks in the input text.

Morpher 502C which identifies a lemma, i.e. a base form, of a word. Inthe embodiment, the lemma defines the fine sense and coarse senseinventories of the word. For example, for the inflected word “jumping”the morpher identifies its base form “jump”.

Parser 502D which identifies relationships between the words in theinput text. Parser 502D identifies grammatical structures and phrases inthe input text. The result of this operation is a parse tree, which is aconcept very well known in the field. Some relationships include“subject of the verb” and “object of the verb”. From the phrases, a listof syntactic and semantic dependencies can later be extracted. Parser502D also produces part of speech tags that are used to update the partof speech distribution. Parser information is also used to selectpossible compounds.

Dependency extractor 502J uses the parse tree to generate a list ofsyntactic and semantic dependencies, which will be familiar to thoseskilled in the art. The semantic dependencies are used by a number ofother components to enhance their models. Dependencies are extracted inthe following manner:

1. Parser 502D is used to generate a syntactic parse tree, includingsyntactic heads for each phrase.

2. Using set of heuristics, as will be familiar to those skilled in theart, semantic heads are generated for each phrase. Semantic heads differfrom syntactic heads as the semantic rules give preference tosemantically important elements (like nouns and verbs) while syntacticheads give preference to syntactically important elements likeprepositions.

3. Once a semantic head (word or phrase) is identified, sister words andphrases are considered to form dependencies with the head.

Named-entity recogniser 502E identifies known proper nouns such as“Albert Einstein” or “International Business Machines Incorporated” andother multi-word proper nouns. Named-entity tagger 502E collects tokensthat form a named entity into groups and classifies the group intocategories. Such categories include: a person, location, artefact, aswill be familiar to those skilled in the art. Named-entity categoriesare determined by a Hidden Markov Model (HMM) that is trained on partsof the sense-tagged corpus 404 in which the named entities have beenmarked. For example in the text fragment “Today Coca-Cola announced . .. ”, the HMM will categorize “Coca-Cola” as a company (instead of anartefact) because of analysis of the surrounding words. Many techniquesexist for named entity recognition as will be familiar to those skilledin the art.

Part-of-speech tagger 502F assigns functional roles such as “noun” and“verb” to the words in the input text. Part of speech tagger 502Fidentifies a part of speech, which can be mapped to the broad parts ofspeech (noun, verb, adverb, adjective) relevant to disambiguatingbetween word senses. Part-of-speech tagger 502F utilizes several atrigram-based Hidden Markov Model (HMM) trained on a portion ofsense-tagged corpus 404 which has been annotated with part of speechinformation. Many techniques exist for part of speech tagging, as willbe familiar to those skilled in the art.

Compound finder 502H finds possible compounds in the input text. Anexample of a compound is “coffee table” or “fire truck”, which althoughsometimes written as two words need to be treated as a single word forthe purposes of word sense disambiguation. Knowledge base 400 contains alist of compounds, which can be identified in the text. Each identifiedcompound is given a probability which marks the likelihood that thecompound was correctly formed. The probability is calculated from thesense-tagged corpus 404.

Turning now to ICS 500, ICS 500 controls the sequence in whichlinguistic components 502 and WSD components 504 are operated on text,to continually reduce the amount of ambiguity in a text being processed.It has several specific functions:

1. It coordinates extraction of required elements from text utilizingselected linguistic components 502 and provides such elements to WSDcomponents 504. through a common interface.

2. It seeds an initial set of sense possible for each word using seeder500A, which associates an initial set of possible senses from theknowledge base 400 to each word in the text to identify to the WSDcomponents 504 which senses they must disambiguate between, thusproviding an initial maximum level of ambiguity.

3. It invokes WSD components 504 according to an algorithm mixidentified by control file 516. Activations of the selected WSDcomponents 504 then attempt to disambiguate the text, providingprobabilities and confidence scores associated with possible senses ofthe words in the text. Preferably, WSD components are invoked inmultiple iterations.

4. It merges and integrates output from multiple components usingmerging module 500B and ambiguity eliminator 500C. Merger module 500Bcombines the outputs of all of the WSD components 504 into a singlemerged probability distribution and confidence score. Ambiguityeliminator 500C which determines which sense ambiguity can be removedfrom the text based upon the output of merger module 500B.

More detailed description of the function and design of ICS 500 isprovided in subsequent sections describing the operation of the processof word sense disambiguation.

The control file optimizer 514 optionally performs a training procedurewhich outputs a “recipe” in the form of control file 516, which containsoptimal sequence and parameters for the WSD components 504 in eachiteration, and is used by ICS 500 during word sense disambiguation. Moredetailed description of the function and design of control fileoptimizer 514 is provided in subsequent section describing thegeneration of an optimized control file.

Further detail is now provided on steps performed by the embodiment toprocess text. Referring to FIG. 6, a process to perform disambiguationof text generally by reference 600. The process may be divided into foursteps. The first step is to generate an optimized control file 602. Thisstep creates a control file which is used in the step disambiguate text606. The second step read text 604 comprises reading in the text to bedisambiguated from a file. The third step disambiguate text 606 consistsof disambiguating the text, and is the main step in the process. Thefourth step output disambiguated text 608 consists of writing thesense-tagged text to a file.

Referring to FIG. 7, further detail is now provided on the mainprocessing step, disambiguate text 606.

Upon receiving a text to disambiguate, ICS 500 processes the text in thefollowing manner:

1. ICS 500 passes the text through tokenizer 502A to identify theboundaries of the words and separate these from punctuation symbols thatmay be present in the text.

2. ICS 500 causes the syntactic features in the text to be identified bypassing the text through linguistic components 502. Such featuresinclude: lemma (including compounds), part of speech, named entities andsemantic dependencies. Each feature is generated with a confidence scoreand with a probability distribution.

3. Processed text is then provided to seeder 500A which uses lemma andpart of speech generated by linguistic components 502 to identify a listof possible senses in the knowledge base 400 for each word in the text.

4. ICS 500 then applies a set of WSD components 504 independently to theinput text, where specific WSD components 504 and a sequence of theirexecution are specified in control file 516. Each WSD component 504disambiguates some or all of the words in the text. For senses that aredisambiguated, a probability distribution and a confidence score aregenerated by each WSD component 504.

5. ICS 500 then performs a merging operation using merging module 500B.This module merges the results of all components for all words togenerate a single probability distribution of senses and associatedconfidence score for each word. Prior to merging, if specified in thecontrol file 516, ICS 500 may discard results with insufficiently highconfidence, or for which the probability of the top result isinsufficiently high. The merged probability distribution is the weightedsum of each remaining probability distribution, with the weight beingprovided by the confidence score. The merged confidence score is aweighted average of confidence values, with weights provided by theconfidence score. For example, if a WSD component “A” had given “hotbeverage” at 100% probability for the sense of the word “Java”, and WSDcomponent “B” had given “programming language” at 100% probability forthe same word, then the merged distribution would contain both “hotbeverage” and “programming language” at 50% probability each. In orderto merge the results of WSD components 504 that produce only coarsesenses, the merger can optionally be run twice, once on the coarsesenses and a second time over the group of fine senses associated witheach coarse sense.

6. ICS 500 then performs ambiguity reduction using ambiguity eliminator500C. The embodiment performs this process based upon the mergeddistribution and confidence output by merging module 500B. When a sensein the merged distribution has a deemed very high probability and highconfidence, it is deemed to contain the correct sense and all othersenses can be removed. For example, if a merged result indicated thatthe disambiguation for “java” was “coffee” with 98% probability and itsconfidence score was 90%, then all other senses would be excluded asbeing possible, and “coffee” would be the sole remaining sense. Controlfile 516 sets probability and confidence score thresholds for thisdecision point. Conversely, when one or more senses have a very lowprobability and high confidence score, such senses may be deemed to beimprobable and are removed from the set of senses. Again control file516 sets probability and confidence thresholds for this decision point.This process reduces ambiguity from the input text by utilizinginformation provided by WSD components 504, and accordingly influenceswhich senses are provided to WSD components 504 during subsequentiterations of disambiguation.

7. At least one or more iterations of steps 4, 5 and 6 may optionally beperformed. It will be appreciated that results of each subsequentiteration will likely be different than those of previous iteration(s),as WSD components 504 themselves do not predict senses which wereeliminated after previous iterations. WSD components 504 make use of thereduced ambiguity as compared to the previous iteration to produce aresult with a more accurate distribution and/or higher confidence score.Control file 516 identifies which set of WSD components 504 is appliedon each iteration. It will be appreciated that several iterations may beperformed until a sufficient number of words have been disambiguated oruntil the number of iterations specified in the control file 516 havebeen completed.

In the embodiment, the word sense disambiguation process may involvemultiple iterations. Typically, in each iteration, only a portion ofambiguity can be removed without introducing a large number ofdisambiguation errors. Preferably, for each word that any selected WSDcomponent 504 attempts to disambiguate, the selected WSD component 504returns a full probability distribution over those senses which had notpreviously been removed. Generally, a WSD component 504 is not allowedto increase ambiguity of a text by re-submitting a sense for a wordwhich has previously been discarded for that word. Also, each WSDcomponent in an iteration operates independently from the others andinteractions between WSD components 504 occur under the control of ICS500 or via ambiguity removed in a previous iteration. In otherembodiments, different degrees of interaction and knowledge of resultsbetween WSD components during an iteration and between iterations may beprovided. It will be appreciated that due to the highly complex andunpredictable nature of such interactions, systems that include a highdegree of interaction between WSD components 504 explicitly programmedinto the WSD components 504 tend to be too complex to built practically.As such, the controlled interaction between WSD components 504 providedby the structure of the ICS and the independence of the WSD components504 is a key advantage of the embodiment and invention.

The combined action of merger module 500B and ambiguity eliminator 500Cis to post-process the results of several WSD algorithms 504 to reduceambiguity in the text. The combined action of these modules is referredto as the post processing module 512. It will be appreciated that theuse of a merging module 500B and an ambiguity reducer 500C as describedin the embodiment is an exemplary technique in this particularembodiment only and that alternative techniques could be devised. Forexample, post processing module 512 may utilize a machine learningtechnique, such as a neural network, to merge and prune results. In thisalgorithm, the probability distributions and confidence scores of eachalgorithm are fed into a learning system, which generates a combinedprobability and confidence score for each sense.

In relation to the merger module 500B, other algorithms, such as votingalgorithms and merging of rankings algorithms may be used.

Referring to FIG. 8, further details are now provided on control fileoptimizer process 514 used to generate an optimized control file 516providing maximum disambiguation accuracy. The process begins with asense tagged corpus 802. In the embodiment, this sense tagged corpus isa portion of the sense tagged corpus 404 that has been set aside for thepurpose of performing control file optimizer process 514. Control fileoptimizer 514 uses the WSD module 606 to generate a control file 516that optimizes accuracy of the WSD module over the sense tagged corpus.

Control file optimizer 514 requires that optimization criteria arespecified. Thresholds are specified separately for either the percentageof ambiguity to be removed, or the percentage accuracy ofdisambiguation; the control file optimizer then optimizes the controlfile to maximize the performance of word sense disambiguator on onemeasure given the threshold for the other. It is also possible tospecify a maximum number of iterations. The number of correct results orthe amount of ambiguity removed given are then maximized for eachiteration. After the optimal combination of algorithms and thresholdsfor a given accuracy have been determined, the training proceeds to thenext iteration. The target accuracy is lowered at each iteration, whichallows the standard of results to drop gradually as the number ofiterations increases. Multiple sequences of target accuracy are testedand the sequence producing the best results over the sense tagged corpus802 is selected. Preferentially, accuracy or remaining ambiguity isprogressively reduced on each subsequent iteration. Example iterationaccuracy sequences that are tested are:

1. 95%−>90%−>85%−>80%

2. 90%−>80%

For a given iteration and target disambiguation accuracy, the optimallist of algorithms to invoke and the associated probability andconfidence thresholds of results to keep is identified by executing thefollowing steps:

1. Invoke each WSD component 504 individually on sense-tagged corpus 802to obtain a set of results for each component.

2. For a set of results of a WSD component 504, search space ofprobability and confidence threshold to identify thresholds whichmaximize performance against the optimization criteria. This is donethrough a search of all combinations of probability and confidencethresholds in the range of 0% to 100% in fixed step increments, such as5%.

3. Once optimal thresholds for each WSD component 504 are identified,results of all WSD components 504 are pruned according to thosethresholds and are merged using the merging module 500B as describedearlier.

4. Consolidated merged results are then searched to identify probabilityand confidence thresholds of merged results that optimize a number ofcorrect answers with an accuracy equal to or above the target accuracyfor the iteration. This is preferably performed using the method of step2.

5. Step 4 is repeated for WSD component 504 that was merged but theresults of the WSD component 504 of interest are excluded. Theprobability and confidence thresholds to maximize the number of correctresults of this result set are them identified. The difference betweenthe maximum number of correct results of this set compared to the numberobtained in step 4 indicates a contribution of correct unique answers ofthe algorithm of interest. If the contribution of a WSD component 504 isnegative, it identifies that this WSD component 504 as having adetrimental impact on the results. If the contribution is zero, then itidentifies that the WSD component 504 is not contributing new correctresults in the iteration. In either case, the WSD component 504 havingthe lowest negative contribution is removed from the list of WSDcomponents 504 to be invoked in subsequent iterations.

6. Step 5 is repeated until a set number WSD components 504 that have anegative or zero contribution are identified and removed. The number maybe all WSD components 504.

7. Steps 2 through 6 are repeated but with the target accuracy for ofstep 2 modified by a small increment, e.g. 2.5% both above and thenbelow the target accuracy of the iteration.

8. The combination of WSD components 504 and the associated probabilityand confidence thresholds that resulted in the largest number of correctanswers are retained as the solution to a given iteration. Thethresholds for probability and confidence for each WSD algorithm 504 andthe ambiguity reducer 500C are written to the control file, and thetraining proceeds to the next iteration and target disambiguationaccuracy.

The control file optimizer 514, can be set to optimize accuracy giventhat each word is assigned one and only one sense, the above descriptionimplies. It will be recognized that for certain applications or incertain specific instances, it may not make sense to attempt to assignonly one sense to each word, or to disambiguate all the words.

The amount of ambiguity present in text prior to any disambiguation maybe considered to be the maximum ambiguity. The amount of ambiguitypresent in fully sense-tagged text, for which each word has beenassigned one and only one word sense can be considered to be the minimumambiguity. It will be recognized that for some applications or incertain cases it will be useful to remove only part of the ambiguitypresent in the text. This can be accomplished by allowing a word to havemore than one possible sense, or by not disambiguating certain words, orboth of these. In the embodiment, the percentage of ambiguity removed isdefined as the (number of senses discarded), divided by the (totalnumber of possible senses minus one). It will further be recognizedthat, in general, removing a smaller percentage of ambiguity permitsword sense disambiguator 32 to return a more accurate results, giventhat word sense disambiguator 32 can specify more than one possiblesense for a word, and where a word is considered correctly disambiguatedif senses specified for the word include the correct sense of the word.

Optionally, the control file optimizer 514 can be provided with separateoptimization criteria and thresholds for the percentage of ambiguity tobe removed by the word sense disambiguator 32 and the accuracy of thedisambiguation results of word sense disambiguator 32. The control fileoptimizer 514 can be asked to either a) maximize the amount of ambiguityremoved subject to a minimum threshold of accuracy (for example, removeas much ambiguity as possible, ensuring that the remaining possiblesenses for the words are 95% likely to contain the correct sense), or b)to maximize disambiguation accuracy subject to a minimum percentage ofambiguity to remove (for example, maximize accuracy subject to removingat least 70% of additional senses for each word). This capability isuseful in applications a) because it allows word sense disambiguator 32to better fit the real world of natural language texts, in which wordsmay be truly ambiguous (i.e. ambiguous to a human) as expressed in atext, and therefore not possible to fully disambiguate, and b) becauseit allows applications making use of word sense disambiguator 32 to optfor more or less conservative implementations of word sensedisambiguator 32, wherein the precision of the disambiguation is lower,but fewer correct senses are discarded. This is particularly valuable,for example in information retrieval applications for which it iscritical that correct information is never discarded (e.g. due toincorrect disambiguation), even at the expense of including extraneousinformation (e.g. due to additional incorrect senses being present inthe disambiguated text).

Optionally, the control file optimizer 514 can be provided with amaximum number of iterations.

It will be appreciated that creating accurate confidence functions isimportant. A component with a poor confidence function, even a componentwith high accuracy, will not contribute or will contribute less thanoptimally to the system accuracy. This occurs in one of two ways:

-   1. If the confidence function tends to frequently give a low    confidence value to a correct result, then merger 500A will    effectively ignore this result, due to the arithmetic of the merger    whereby results are weighted by the confidence score, with the net    effect being as if the component had not given a result at all for    that word. Thus, these correct results will be excluded from    contributing to the system due to the poor confidence function.-   2. On the other hand, if the confidence function gives a high    confidence value to incorrect results, then the automatic training    procedure will recognize that the algorithm contributes many    incorrect results, and exclude it from being run.

It will be appreciated that adding an algorithm with a poor confidencefunction to the system (for example, one which is overly optimistic andoften produces incorrect results with 100% confidence) does not severelydetrimentally affect the accuracy of the system, as the control fileoptimization procedure 514 described above will discounts such resultsand it will not execute that algorithm in further iterations ofdisambiguation. This provides a level of robustness to the systemagainst the inclusion of poor WSD components.

It will be apparent to those skilled in the art that the accuracy ofmost WSD systems increases with the size of the training corpus butdecreases with an inaccurately tagged training corpus. The addition ofaccurately sense-tagged text to the training corpus will usuallyincrease the effectiveness of WSD components. In addition, most WSDcomponents 504 require a portion of the sense-tagged corpus 404 to beset aside for the training of their confidence function. It will beappreciated that the effectiveness of the confidence function increasesas the amount of sense-tagged text in the portion of the sense-taggedcorpus 404 set aside for confidence function training increases.

Sense-tagged corpus 404 can be created manually by human lexicographers.It will be appreciated that this is a time consuming and expensiveprocess, and that finding a way to generate or augment sense-taggedcorpus 404 automatically would be of substantial value.

Referring to FIG. 9, the embodiment also provides a system and methodfor automatically providing a sense-tagged corpus 404 or forautomatically increasing the size of sense-tagged corpus 404 for thetraining of WSD components 504. There are two processes illustrated inFIG. 9. The first is the component training process 960. This processuses sense tagged text 404 or untagged text 900 as an input to the WSDcomponent training module 906 in order to generate improved componentresources for the WSD components 504. The second process is the corpusgeneration process 950. This process processes untagged text 900 orpartially tagged text 902 through the WSD module 32. Using theconfidence function and probability distributions output by the WSDprocess 32, senses which are likely to be incorrectly tagged are thenfiltered out by the filter module 904. This partially sense tagged textcan then be added to the partially tagged text 902 or the sense taggedcorpus 404. When these two processes component training process 960 andcorpus generation process 950 are run alternatively, the effect is toimprove the accuracy of the WSD module 32 and to increase the size ofthe sense-tagged corpus 404.

As described above, it will recognized that most conceivable WSDcomponents 504 require a training process to be performed over a sensetagged corpus 404 before they can be used to disambiguate text. Forexample, priors component 504A requires that the frequencies of sensesbe recorded from a sense tagged corpus 404. These frequencies are storedin the WSD component resources 402. As described above, the more sensetagged text 404 is available to the training process, the more accurateeach WSD algorithm 504 will be. The collection of the training processesof all WSD components 504 is collectively referred to in FIG. 9 as theWSD component training process 960.

As described above, results of several WSD components 504 are combinedto disambiguate previously unseen text. This is a process known as“bootstrapping”.

With the embodiment, only results with sufficiently high confidence areadded to the training data, utilizing the following algorithm:

1. Train each model of each word sense disambiguation using thecomponent training process 960 using available training data from thesense tagged corpus 404.

2. Disambiguate a large quantity of untagged documents 900 using the WSDmodule 32; preferably a very large quantity of documents are used fromvarious domains.

3. In the filter module 904, discard all results where the result isambiguous or where the confidence is below a threshold, which may beadjusted.

4. Add the non-discarded senses to the sense tagged data 404.

5. Re-train the set of word sense disambiguation components using thecomponent training process 960.

6. Restart the training over the same documents which are now in thesense tagged corpus 404 or over a new body of untagged text 900.

A key to this process is the use of a probability distribution andconfidence score. In prior art systems, a confidence score is notavailable and inaccurate results cannot be discarded. As a result, theWSD components 504 are less accurate after retraining on the enlargedsense tagged corpus 404 than they were before, and such a process is notpractically useful. By setting a high confidence threshold that rejectsmost incorrect senses from being added to the sense tagged corpus 404,the embodiment eliminates this deficiency in the prior art system andallows the training data to be enlarged with high quality tagged text.It will be appreciated that this process can run multiple times, and maycreate a self-reinforcing loop that increases both the size of the sensetagged corpus 404 and the accuracy of the WSD system 32. The quality ofthe training data extracted (due to the use of a probabilitydistribution and a confidence score) and the potentiallyself-reinforcing nature of the bootstrapping process are features of theembodiment.

The embodiment also provides a variant of the above bootstrappingprocess to train the system for a specific domain (e.g., law, health,etc.), utilizing the following variation on the algorithm:

1. A number of documents are disambiguated by a highly accurate method,such as manually by a skilled human. Use of these documents provides“seeding resources” to the system, which are added to the sense taggedcorpus 404.

2. The word sense disambiguation components are trained using the WSDcomponent training process 960.

3. A large quantity of documents from the domain are automaticallydisambiguated and added to the sense tagged corpus 404 using the corpustagging process 950.

It will be apparent that the embodiment has several advantages over theprior art. Some include:

1. Multiple independent algorithms. The embodiment allows morecomponents to be incorporated utilizing a simplified interface throughICS 500. As such, several disambiguation techniques (for example between10 and 20) without the system becoming too complex to manipulate.

2. Confidence functions. In prior art systems, a confidence score is notavailable. The confidence score provides several critical advantages inprior art systems:

-   a) Merging together of results of multiple components. The    confidence function allows results from different probabilistic    algorithms to be combined with different weights reflecting the    expected accuracy of the algorithm in a particular situation. Using    the confidence function invention above, the system can merge    together decisions of many components to obtain a more likely sense.-   b) Discarding poor results or word senses for truly ambiguous words.    It allows potentially inaccurate results to be discarded, such    embodiment can opt not to provide senses for words for which it has    little confidence in its answer. This reflects better the real world    of natural language expression, wherein some expressions remain    ambiguous even when analyzed by a human.-   c) Bootstrapping. The confidence function provides a likelihood that    each answer is correct. This allows only highly accurate results to    be kept and reused as training text for components and the overall    system. Additional training text in turn further improves the    accuracy of the components and the overall system. This is a highly    accurate form of bootstrapping, and offers a comparable gain in    performance to sense-tagging additional training text using human    lexicographers, at a tiny fraction of the cost. The amount of    sense-tagged text that can be generated from untagged text (for    example, the Internet) with this technique is limited only by    available computer capacity Prior art systems have performed    bootstrapping without a confidence score, but the sense tags in the    text fed to the system are far less accurate than those provided by    a human lexicographer or a confidence-score enabled system, and the    overall performance of the system quickly stagnates or degrades.

3. Iterative disambiguation. The system allows a component to havemultiple passes over the text being disambiguated, which allows it touse high-accuracy disambiguations (or reductions in ambiguity) providedby any of the other components, to improve its accuracy indisambiguating the remaining words. For example, when faced with thewords “cup” and “green” in one sentence, a particular WSD component 504may not be able to distinguish between a “cup” sense for “golf” and themore mundane “drinking vessel”. If another WSD component 504 is able todisambiguate the word “green” into its “golf green” sense, then thefirst WSD component 504 may now be able to correctly disambiguate “golf”into “golf cup”. In this sense, WSD components 504 interact with eachother to arrive at more likely senses.

4. Method for automatically tuning WSD module 32. WSD module 32 includesa method for merging an optimal “recipe” of components and parametervalues. This merged set is optimal in the sense that it provides theparameters which utilise multiple iterations of multiple components toobtain the maximum possible accuracy.

5. Multiple levels of ambiguity. By operating simultaneously on coarseand fine senses, the embodiment can integrate different componentseffectively. For example, several classes of linguistic componentsoperate by attempting to discern a topical content of text. These typesof components tend to have poor accuracy over fine senses, since theseoften respect grammatical rather than semantic distinctions, but do verywell over coarse senses. The WSD module 32 is capable of merging resultsbetween components that give fine and coarse senses, allowing eachcomponent to operate over the sense granularity most appropriate forthat component. Furthermore, an application that requires only coarsesenses can obtain these from WSD module 32. Due to their coarseness,these coarse senses will have higher accuracy than the fine senses.

6. Use of domain-specific data. If information about the problem domainis known, the embodiment can be biased to favour senses which match theproblem domain. For example, if it is known that a particular documentfalls within the domain of Law, then WSD module 32 can provide sensedistributions to the components which favour those terms in the legaldomain.

7. Gradual reduction in ambiguity. It will be appreciated that prior artsystems perform disambiguation by attempting to choose one single sensefor each word in a single iteration, which amounts to removing allambiguity at once. This decreases the accuracy of the disambiguation.The embodiment instead performs this process gradually, removing some ofthe ambiguity at each iteration.

Optionally, the embodiment uses metadata. For example, the title of thedocument can be used to aid in the disambiguation of the document'stext, by allowing the words in the title to carry disproportionateweight towards the disambiguation.

Although the invention has been described with reference to certainspecific embodiments, various modifications thereof will be apparent tothose skilled in the art without departing from the scope of theinvention as outlined in the claims appended hereto. A person skilled inthe art would have sufficient knowledge of at least one or more of thefollowing disciplines: computer programming, machine learning andcomputational linguistics.

1. A method of processing natural language text utilizing a plurality ofdisambiguation components to identify a disambiguated sense for saidtext, said method comprising steps of: applying a selection ofcomponents from said plurality of disambiguation components to said textto identify a local disambiguated sense for said text, wherein eachcomponent of said selection provides a local disambiguated sense of saidtext with a confidence score and a probability score; and saiddisambiguated sense is determined utilizing a selection of localdisambiguated senses from said selection.
 2. The method of processingnatural language text as claimed in claim 1, wherein said selection ofcomponents are sequentially activated and controlled by a centralmodule.
 3. The method of processing natural language text as claimed inclaim 2, further comprising identifying a second selection of componentsfrom said plurality of components; applying said second selection tosaid text to refine said disambiguated sense, wherein each component ofsaid second selection provide a second local disambiguated sense of saidtext with a second confidence score and a second probability score; andsaid disambiguated sense is determined utilizing a selection of secondlocal disambiguated senses from said second selection.
 4. The method ofprocessing natural language text as claimed in claim 3, furthercomprising after applying said selection to said text and prior toapplying said second selection to refine said disambiguated sense,eliminating a sense from said disambiguated sense having a confidencescore below a threshold.
 5. The method of processing natural languagetext as claimed in claim 4, wherein when a particular component of saidplurality of components is present in said selection and said secondselection, at least one of its confidence and probability scores isadjusted when applying said second selection to said text.
 6. The methodof processing natural language text as claimed in claim 4, wherein saidselection and said second selection of components are identical.
 7. Themethod of processing natural language text as claimed in claim 4,wherein said confidence score of said each component is generated by aconfidence function utilizing a trait of each component.
 8. The methodof processing natural language text as claimed in claim 4, wherein afterapplying said selection of components to said text to identify a localdisambiguated sense for said text, said method further comprising foreach said component of said selection, generating a probabilitydistribution for its disambiguated sense; and merging all probabilitydistributions for said selection.
 9. The method of processing naturallanguage text as claimed in claim 8, wherein said selection of componentdisambiguates said text using context of said text identified from oneof domain; user history; and specified contexts.
 10. The method ofprocessing natural language text as claimed in claim 8, furthercomprising after applying said selection to said text, refining aknowledge base of each component in said selection utilizing saiddisambiguated sense.
 11. The method of processing natural language textas claimed in claim 4, wherein at least one of said selection ofcomponents provides results only for coarse sense s.
 12. The method ofprocessing natural language text as claimed in claim 4, wherein resultsof said selection of components are combined into one result utilizing amerging algorithm.
 13. The method of processing natural language text asclaimed in claim 12, wherein said process utilizes a first stagecomprising merging of coarse senses, and a second stage comprisingmerging of fine senses within each coarse sense grouping.
 14. The methodof processing natural language text as claimed as claimed in claim 13,wherein said merging process utilizes a weighted sum of probabilitydistributions, and said weights are the confidence score associated withsaid distribution, and wherein said merging process comprises a weightedaverage of confidence scores, and said weights are again the confidencescores associated with said distribution.
 15. A method of generatingsense-tagged text, said method comprising steps of: disambiguating aquantity of documents utilizing a disambiguation component; generating aconfidence score and a probability score for a sense identified for aword provided by said component; if said confidence score for said sensefor said word is below a set threshold, said sense is ignored; and ifsaid confidence score for said sense for said word is above said setthreshold, said sense is added to said sense-tagged text.
 16. A methodof processing natural language text utilizing a plurality ofdisambiguation components to identify a disambiguated sense or sensesfor said text, said method comprising steps of: defining an accuracytarget for disambiguation; and applying a selection of components fromsaid plurality of disambiguation components to meet said accuracytarget.
 17. A method of processing natural language text utilizing aplurality of disambiguation components to identify a disambiguated sensefor said text, said method comprising steps of: identifying a set ofsenses for said text; and identifying and removing an unwanted sensefrom said set.
 18. A method of processing natural language textutilizing a plurality of disambiguation components to identify adisambiguated sense for said text, said method comprising steps of:identifying a set of senses for said text; and identifying and removinga specified amount of ambiguity from said set of senses.